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Abstract 



We develop a theory for the temporal integration of visual motion motivated by 
psychophysical experiments. The theory proposes that input data are temporally 
grouped and used to predict and estimate the motion flows in the image sequence. 
This temporal grouping can be considered a generalization of the data association 
techniques used by engineers to study motion sequences. Our temporal-grouping 
theory is expressed in terms of the Bayesian generalization of standard Kalman 
filtering. To implement the theory we derive a parallel network which shares some 
properties of cortical networks. Computer simulations of this network demonstrate 
that our theory qualitatively accounts for psychophysical experiments on motion 
occlusion and motion outliers. 
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1 Introduction 



Local motion signals are often ambiguous and many important motion phenomena 

can be explained by hypothesising that the human visual system uses temporal 

coherence to resolve ambiguous inputs^ For example, the perceptual tendency to 

disambiguate the trajectory of an ambiguous motion path by using the time average 

of its past motion was first reported by Anstis and Ramachandran (Ramachandran 

and Anstis 1983; Anstis and Ramachandran 1987). They called this phenomenon 

motion inertia. Other motion phenomena involving temporal coherence include the 

improvement of velocity estimation over time (McKee et al. 1986, Ascher 1998), blur 

removal (Burr et al. 1986), motion-outlier detection (Watamaniuk et al. 1994), and 

motion occlusion (Watamaniuk and McKee 1995). These motion phenomena pose 

a serious challenge to motion-perception models. For example, in motion-outlier 

detection the target dot is indistinguishable from the background noise if one only 

observes a few frames of the motion. Therefore, the only way to detect the target 

dot is by means of its extended motion over time. Consequently, explanations 

based solely on single, large local-motion detectors as in the Motion-Energy and 

Elaborated-Reichardt models (Hassenstein and Reichardt 1956; van Santen and 

Sperling 1984; Adelson and Bergen 1985; Watson and Ahumada 1985; for a review 

of motion models, see Nakayama 1985) seem inadequate to explain these phenomena 

(Verghese et al. 1999). Also, it has been argued (Yuille and Grzywacz 1998) that 

all these experiments could be interpreted in terms of temporal grouping of motion 

signals involving prediction, observation and estimation. The detection of targets, 

and their temporal grouping, could be achieved by verifying that the observations 

were consistent with the motion predictions. Conversely, failure in these predictions 

would indicate that the observations were due to noise, or distractors, which could 

be ignored. The ability of a system to predict the target's motions thus represents 

1 Temporal coherence is the assumption that motion in natural images is mostly temporally 
smooth, that is, it rarely changes direction and/or speed abruptly. 
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a powerful mechanism to extract targets from background distractors and might be 
a powerful cue for the interpretation of visual scenes. 

There has been little theoretical work on temporal coherence. Grzywacz et al. 
(1989) developed a theoretical model which could account for motion-inertia phe- 
nomena by requiring that the direction of motion varies slowly over time while 
speed could vary considerably faster. More recently Grzywacz et al. (1995) pro- 
posed a biologically plausible model that extends the work on motion inertia and 
allow for the detection of motion outliers. It was observed (Grzywacz et al 1989) 
that for general three-dimensional motion, the direction, but not the speed, of the 
image motion is more likely to be constant and hence might be more reliable for 
motion prediction. This is consistent with the psychophysical experiments on mo- 
tion inertia (Anstis and Ramachandran, 1987) and temporal smoothing of motion 
signals (Gottsdanker 1956; Snowden and Braddick 1989; Werkhoven et al. 1992) 
which demonstrated that the direction of motion is the primary cue in temporal 
motion coherence. Other reasons which might make motion direction more reliable 
than speed are the unreliability of local-speed measurement due to the poor tem- 
poral resolution of local motion units (Kulikowski and Tolhurst, 1973, Holub and 
Morton-Gibson 1981) and to involuntary eye movements. 

The aim of this paper is to present a new theory for visual-motion estimation and 
prediction that exploits temporal information. This theory builds on our previous 
work (Grzywacz et al. 1989; Grzywacz et al. 1995; Yuille et al. 1998) and on 
recent work on tracking of object boundaries (Isard and Blake 1996). In addition, 
this paper gives a concrete example of the abstract arguments on temporal grouping 
proposed by Yuille and Grzywacz (1998). To this end, we use a Bayesian formulation 
of estimation over time (Ho and Lee 1964), which allows us simultaneously to make 
predictions over time, update those predictions using new data, and reject data that 
are inconsistent with the prediction^} This rejection of data allows the theory to 
2 We will be updating the marginal distributions of velocity at each image point rather than 
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implement basic temporal grouping (for speculations about more complex temporal 
grouping, see Yuille and Grzywacz 1998). This basic form of temporal grouping 
is related to data association as studied by engineers (Bar-Shalom and Fortmann 
1988) but differs by being probabilistic (following Ho and Lee 1964). Finally, we 
show that our temporal grouping theory can be implemented using locally connected 
networks, which are suggestive of cortical structures. 

The paper is organized as follows: The theoretical framework for motion pre- 
diction and estimation is introduced in Section [2} Section [3] gives a description of 
the specific model chosen to implement the theory in the continuous time domain. 
Implementation of the model using a locally connected network is addressed in Sec- 
tion [4] and Section [5} The model's predictions are tested in Section |6]by performing 
computer simulations on the motion phenomena described previously. Finally, we 
discuss relevant issues, such as the possible relationship of the model to biology, in 
Section [3 

2 The Theoretical Framework 
2.1 Background 

The theory of stochastic processes gives us a mathematical framework for modelling 
motion signals that vary over time. These processes can be used to predict the 
probabilities of future states of the system (e.g. the future velocities) and are 
called Markov if the probability of future states depends only on their present state 
(i.e. not on the time history of how they arrived at this state). In computer vision, 
Markov processes have been used to model temporal coherence for applications such 
as the optimal fusing of data from multiple frames of measurements (Matthies et 
al. 1989; Clark and Yuille 1990; Chin et al. 1994). Such optimal fusing is typically 
defined in terms of least-squares estimates which reduces to Kalman filtering theory 
the full probability distribution of the entire velocity field, which would require spatial coupling. 
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(Kalman 1960) . Because Kalman niters are (recursive) linear estimators that apply 
only to Gaussian densities, their applicability in complex scenes involving several 
moving objects is questionable. In this paper we propose a Bayesian formulation 
of temporal coherence which is generalization of standard Kalman filters and which 
can deal with targets moving in complex backgrounds. 

We begin by deriving a generalization of Kalman filters which is suitable for 
describing the temporal coherence of simple motion^] Consider the state vector x k 
describing the status of a system at time t k (in our theory x k denotes the velocity 
field at every point in the image). Our knowledge about how the system might 
evolve to time k + 1 is described in a probability distribution function p(x k+ i\x k ), 
known as the "prior" (in our theory this prior will encode the temporal coherence 
assumption). The measurement process that relates the observation z k of the state 
vector to its "true" state is described by the likelihood distribution function p(z k \x k ) 
(in our theory z k will be the responses of basic motion units at every place in the 
image). From these two distribution functions, it is straightforward to derive the 
a posteriori distribution function p(x k+ i\Z k+ i), which is the distribution of x k+ i 
given the whole past set of measurements Z k+1 = (z , • ■ • , z k+1 ) (note that although 
the prediction of the future depends only on the current state the estimate of the 
current state does depend on the entire past history of the measurements). Using 
Bayes' rule and a few algebraic manipulations (Ho and Lee 1964), we get 



where p(x k+ i\Z k ) is prediction for the future state x k +i given the current set of past 
measurements Z k , and p(z k+ i\Z k ) is a normalizing term denoting the confidence in 
the measure z k+ i given the set of past measurements Z k . The predictive distribution 
function can be expressed as 




(1) 




(2) 



3 Our development follows the work of Ho and Lee 1964. 
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Equations [TJ [2] are generalizations of the Wiener-Kalman solution for linear 
estimation in presence of noise. Its evaluation involves two stages. In the prediction 
stage, given by Equation [2j the probability distribution p(x k+ i\Z k ) for the state at 
time k + 1 is determined. This function, which involves the present state and the 
Markov transition describing the dynamics of the system, has a variance larger than 
that of p(xk\Zk). This increase in the variance results from the non-deterministic 
nature of the motion. In the measurement stage, performed by Equation [TJ the new 
measurements Zk+i are combined using Bayes' theorem and, if consistent, reinforce 
the prediction and decrease the uncertainty about the new state. (Inconsistent 
measurements may increase the uncertainty still further). 

It was shown (Ho and Lee 1964) that if the measurement and prior probabilities 
are both Gaussians, then Equations [T] and [2] reduce to the standard Kalman equa- 
tions which update the mean and the covariance of and p(xk+\\Zk) 
over time. Gaussian distributions, however, are non-robust (Huber 1981) and an 
incorrect (outlier) measurement can seriously distort the estimate of the true state. 
Standard linear Kalman filter models are therefore not able to account for the psy- 
chophysical data which demonstrate, for example, that human observers are able 
to track targets despite the presence of inconsistent (outlier) measurements (Wata- 
maniuk et al. 1994, Watamaniuk and McKee 1995). Various techniques, known 
collectively as data association (Bar-Shalom and Fortmann 1988), can be applied to 
reduce these distortions by using an additional stage which decides whether to reject 
or accept the new measurements. From the Bayesian perspective, this extra stage is 
unnecessary and robustness can be ensured by correct choices of the measurement 
and prior probability models. More specifically, we specify a measurement model 
which is robust against outliers. 
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2.2 Prediction and Estimation Equations for a Flow Field 

We intend to estimate the motion flow field, defined over the two-dimensional image 
array, at each time step. We describe the flow field as {v(x, t)}, where the parenthe- 
ses {•} denote variations over each possible spatial position x in the image array. The 
prior model for the motion field and the likelihood for the velocity measurements 
are described by the probability distribution functions P({v(x,t)}\{v(x,t — 5)}) 
and P({(f(x,t)}\{v(x,t)}) (see Section[i]for details), where 5 is time increment and 
{<f)(x,t)} represents the response of the local motion measurement units over the 
image array. Our theory requires that the local velocity measurements are normal- 
ized so that J2fj,<P(x,t) = 1 for all x and at all times t (see Section [3] for details). 
We let $(£) = ({4>(x, t)}, {<j)(x, t — S)}, - - -) be the set of all measurements up to and 
including time t, and P({v(x, t — S)}\$(t — S)) be the system's estimated probability 
distribution of the velocity field at time t — S. Using Equations [T] and [2] described 
above, we get 

P{{mt)}m) = P({fet) } |U7( 5 ,t) } )P(Kf,t) } | $ (t-.) ); 

P({(f>{x,t)}\$(t - S)) 

for the estimation, and 

p({v(x,t)}m-s)) = 

P({v(x,t)}\{v(x,t- 5)})P({v(x,t- S)}\$(t - 5))d[{v(x,t- 5)}], (4) 



for the prediction. 

2.3 The Marginalization Approximation 

For simplicity and computational convenience, we assume that the prior and the 
measurement distributions are chosen to be factorizable in spatial position, so that 
the probabilities at one spatial point are independent of those at another. This 
restriction prevents us from including spatial coherence effects which are known to 
be important aspects of motion perception (see, for example, Yuille and Grzywacz 



1988). For the types of stimuli we are considering these spatial effects are of little 
importance and can be ignored. 

In mathematical terms, this spatial-factorization assumption means that we can 
factorize the prior P p and the likelihood Pi so that: 

P P ({v(x,t)}\{v(x',t- 5)}) = l[ Pp {v(x,t)\{v(x',t- 5)}) 

x 

Pl({$(£,t)}\{*(iB,t)}) =Y[ Pl ($(x,t)\v(x,t)}). (5) 

X 

A further restriction is to modify Equation [4] so that it updates the marginal 
distributions at each point independently. This again reduces spatial coherence be- 
cause it decouples the estimates of velocity at each point in space. Once again, we 
argue that this approximation is valid for the class of stimuli we are considering. 
This approximation will decrease computation time by allowing us to update the 
predictions of the velocity distributions at each point independently. The assump- 
tion that the measurement model is spatially factorizable means that the estimation 
stage, Equation [3j can also be performed independently. 

This gives update rules for prediction p pre d- 

Vpred {v{x,t)\®{t-5)) = J[dv(x*,t-5)}P p (v T (x,t)\{v(x',t-8)}) 

P e ({v(x',t-5)}mt-6)). (6) 

and for estimation P e : 

prr t)mt)) = Pi($(x } t)\v(x } t))P wed (v{x,t)\^{t - 5)) 

P($(x,t)\<S>(t - 6)) 

In what follows we will consider a specific likelihood and prior function which 
allows us to simplify these equations and define probability distributions on the 
velocities at each point. This will lead to a formulation for motion prediction and 
estimation where computation at each spatial position can be performed in parallel. 
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3 The Model 



We now describe a specific model for the likelihood and prior functions which, as 
we will show, can account for many psychophysical experiments involving temporal 
motion coherence. Based on this specific prior function, we then derive a formulation 
for motion prediction in the continuous time domain. 

3.1 Likelihood Function 

The likelihood function gives a probabilistic interpretation of the measurements 
given a specific motion. It is probabilistic because the measurement is always cor- 
rupted by noise at the input stage. (In many vision theories the likelihood function 
depends on the image formation process and involves physical constraints such as, 
for instance, the geometry and surface reflectance. Because we are considering psy- 
chophysical stimuli, consisting of moving dots, we can ignore such complications). 

To determine our likelihood function, we must first specify the input stage. Ide- 
ally, this would involve modeling the behaviors of the cortical cells sensing natural 
image motion, but this would be too complex to be practical. Instead, we use 
a simplified model of a bank of receptive fields tuned to various velocities v^Xjt), 
/i — 1 • • • M, and positioned at each image pixel. These cells have observation activ- 
ities {(f>i(x,t)} which are intended to represent the output of a neuronally plausible 
motion model such as (Grzywacz and Yuille 1990). (In our implementation, these 
simple model cells receive contributions from the motion of dots falling within a 
local neighbourhood, typically the four nearest neighbours. Intuitively, the closer 
the dot is to the center of the receptive field, and the closer its velocity is to the 
preferred velocity of the cell, then the larger the response. The spatial profile and 
velocity tuning curve of these receptive fields are described by Gaussian functions 
whose covariance matrices S m:a; and S m:1 , depend on the direction of v^(x,t), and 
are specified in terms of their longitudinal and transverse components crj^ £ , (Jm-.x,T 
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and cr^. vT . For more details, see section 4.1 of (Yuille, Burgi, Grzywacz 

1997)). 

The likelihood function specifies the probability of the receptive field responses 
conditioned on a "true" external motion field. We assume that the measurements 
depend only on the velocity field at that specific position. We can therefore write 
Pi({${x,t)}\{v(x,t)})=n,Pi{${Z,t)\v(Z,t)). We now specify: 

where Pj is the joint probability distribution and we set #j to bfl 

^i(0l(x,t),0 2 (f,t), ■ ■ ■ ,<j>M(x,t),Vn(x,t)) = (j>(x,t) ■ f(Vp(x,t)), (9) 
with tuning curves / given by: 

e -(l/2)(v t -v l _ l ) T T,- 1 (v l -v ll ,) 
fi{Vp{x,t)) = ^Af ie _ (1/2)(flfj _^ ) T S -l (fl r j ._^ )> ( 10 ) 

where £ is the covariance matrix which depends on the direction of V/j,(x,t), and is 

specified in terms of its longitudinal and transverse components o~f. v L and af. vT ). 

The experimental data (Anstis and Ramachandran 1987, Gottsdanker 1956, Werkhoven 

et al. 1992) suggest that temporal integration occurs for velocity direction rather 

than for speed. This is built into our model by choosing the covariances so that 

the variance is bigger in the direction of motion than in the perpendicular direction 

(i.e. the velocity component perpendicular to the motion has mean zero and very 

small variance so the direction of motion is fairly accurate, but the variance of the 

velocity component along the direction of motion is bigger which means that the 

estimation of the speed is not accurate). Observe that we have assumed that the 

response of the measurement device is instantaneous. It would be possible to adapt 

the likelihood function to allow for a time lag but we have not pursued this. Such 

a model might be needed to account for motion blurring. 

4 This is only one of several possible choices. It is attractive, see (Yuille, Burgi, Grzywacz 1997) 
because it leads to a simple linear update rule. 

10 



3.2 Prior Function 



We now turn to the prior function for position and velocity. For a dot moving at 
approximately constant velocity, the state transitions for position and velocity is 
given respectively by x(t + 5) = x(t) + 5v(t) + co x (t), and v(t + 5) = v(t) + u v (t), 
where u x (t) and £u v (t) are random variables representing uncertainty in the position 
and velocity. Extending this model to apply to the flow of dots in an image requires 
a conditional probability distribution P m (v(x, t)\{v^(xi,t — 5)}), where the set of 
spatial positions and the set of allowed velocities are discretized with the spatial 
positions set to be {Si : % = 1, • • • , N}, and the velocities at af, to {v^ : \i = 
1,---,M}. We are assuming that P m ({v(x, t)}\.) = H x P m (v(x,t)\.) so that the 
velocities are predicted independent of each other. We also assume that the prior 
is built out of two components: (i) the probability p(x\xi, v^(xi, t — 5)) that a dot 
at position X{ with velocity at time t — 5 will be present at x at time t, and 
(ii) the probability P(v(x, t — 6)) that it will have velocity v(x,t). These 

components are combined to give: 



P m (v(x, 6)}) 



J2 Yj P{v{x, t)\v^Xi, t - 5)) p(x\x i ,v ll (x i , t-S)), (11) 



where K(x,t) is a normalization factor chosen to ensure that 
P m (v(x, t)\{vn(xi, t — 5)}) is normalized. The two conditional probability distri- 



bution functions in Equation [TT] are assumed to be normally distributed, and thus, 

-(l/2)(x~x i ~8v fl (x i ,t-5)) T 'E~ (x-Xi-Sv^Xifi-S)) 



p(x\x h v^(xi,t - 5)) 
P(v(x, t)\Vfj,{x h t- 5)) 



~ e 



-(l/2)(v(x,t)-v^(x i ,t-8)) T Z- 1 (v(x,t)-v ti (x l ,t-8)) 



(12) 
(13) 



These functions express the belief that the dots are, on average, moving along 



a straight trajectory (Equation 12) with constant velocity (Equation 13). The 



covariances and £„, which quantify the statistical deviations from the model, are 
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defined in a coordinate system based on the velocity vector v^. These matrices are 
diagonal in this coordinate system and are expressed in terms of their longitudinal 
and transverse components a 2 x L , a 2 T , a 2 u L , a 2 T . 

3.3 Continuous Prediction 

Computer simulations of Equation [6] (even for the simple case of moving dots) 
requires large-kernel convolutions which require excessive computational time. In- 
stead, we re-express Equation|6]in the continuous time domain. (Such a re-expression 
is not always possible and depends on the specific choices of probability distribu- 
tions. Note that a similar approach has been applied for the computation of stochas- 
tic completion fields by Williams and Jacobs 1997a,b). As shown in the Appendix 
for Gaussian priors, the evolution of P(v(x,t)\$>(t — 5)) for 5 — > satisfies a vari- 
ant of a partial differential equation, known as Kolmogorov's forward equation or 
Fokker-Planck equation. Our equation is a non-standard variant because, unlike 
standard Kolmogorov theory, our theory involves probability distributions at every 
point in space which interact over time. It is given by: 

-v- -^P(v(x,t)) + J vP(v(x,t))dv. (14) 

where \V\ is the volume of velocity space / dv, and {|^S^J|} and {|^S^J|} 
are scalar differential operators (for isotropic diffusions, these scalar operators are 
Laplacian operators). The terms on the right hand side of this equation stand for 
velocity diffusion (first term), spatial diffusion (second term) with deterministic drift 
(third term), and normalization (fourth term). This equation is local and describes 
the spatio-temporal evolution of the probability distributions P(v(x,t)). 
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4 The Implementation 

We now address the implementation of the update rule for motion estimation (Equa- 
tions [7|). At each spatial position, two separate cells' populations are assumed, one 
for coding the likelihood function and another for coding motion estimation. If we 
were to choose the large- kernel convolution for motion prediction (Equation [6]), then 
the network implementation would reduce to two positive linear networks multiply- 
ing each other followed by a normalization, illustrated in Figure [I] and see (Yuille et 
al 1998). But a problem with implementing this network on serial computers is that 
the range of the connections between motion-estimation cells depends on the mag- 
nitude of Vfj, (that is, we would need long-range connections for high speed tunings). 
Such long range connections occur in biological systems, like the visual cortex, but 
are not well suited to VLSI or serial computer implementations. Alternatively, us- 



ing our differential equation for predicting motion (Equation 14 ) involves only local 
connections. We now focus on this particular implementation (see Figure [2]). 

Kolmogorov's equation can be solved on an arbitrary lattice using a Taylor series 
expansion. The spatial term, which contains the diffusion and drift terms, can be 
written so that the partial derivatives at lattice points are functions of the values 
of the closest neighbor points. For the sake of numerical accuracy, we have chosen 
a spatial mesh system where the points are arranged hexagonally. Furthermore, 
we found it convenient to express the velocity vectors in polar coordinates where 
their norm s and angle 9 can be represented on a rectangular mesh. The change of 
variable is given by p(s,9) = rp(v). In polar coordinates, the differential operator 
XtfJl becomes 



dv v dv 



1 9 d 2 p 2 1 2 d (I \ 1 2 d 2 p 1 
2 a ^d? K ' L 2°^ T) dr \r P ) + 2^ T W r 2 ' 



where a 2 L and a 2 T are the variances in the longitudinal and transverse directions, 
respectively. 

We let the activities of the motion-estimation cells at time t be represented by a 
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Figure 1: Network for Motion Estimation. The network consists of two interacting 
layers. The top square shows the observation layer consisting of cells organized 
in columns on the (x,y) lattice. At each of the twelve spatial positions there is a 
velocity column which we display by two cells, shown as circles, with the adjacent 
arrows indicating their velocity tunings (here either horizontal or vertical). The 
lower square represents the estimation layer, which also consists of cells organized 
as columns on the (x, y) lattice. In the measurement stage, the observation cells in- 
fluence the estimation cells by excitatory (indicated by the "plus" sign) connections 
and multiplicative interactions (indicated by x). The excitation is higher between 
cells with similar velocity tuning, which we indicate by strong (solid lines) or weak 
(dashed lines) connections. In the prediction stage, cells within the estimation layer 
excite each other (again with the strength of the excitation being largest for cells 
with similar velocity tuning). Inhibitory connections within the columns are used 
to ensure normalization. 1 a 




Figure 2: Network for Motion Estimation Using Continuous Prediction. (A) Contin- 
uous motion prediction (lower part of the network) is accomplished through nearest 
neighbor spatial interactions, in this case according to a hexagonal lattice. The 
velocity distribution is represented at each spatial position by a set of velocity cells 
(small circles) organized according to a polar representation (small panel). Ob- 
servation cells in measurement layer (upper part) influence the estimation cells by 
multiplicative interactions (indicated by x) to yield motion estimation a. (B) In- 
teractions between two populations of velocity cells involves cells tuned to the same 
velocities, while interactions within a population involves cells tuned to different 
velocities. 
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vector a(x,t) = (ai(x, t), «m(x, t)). The iterative method for determining such 
activities involves a four-steps scheme. The first three steps are concerned with 
solving Kolmogorov's equation. For a cell at position x = (x, y) and of velocity 
tuning = (s, 9), the three steps are: 



\yisfi — a x,y,s,9 + At^2 A,j( S ) at x,y,s+iAs,e+3A6 



t+1/4 



h3 

x+L,y+x, s ,< 



a xJ,sfi — a x,y!s,e + At ^2 ^2 ^,A S ) at x,y,s'+iAs,e'+jAe^ (^) 

s',6 1 hJ 

where the coefficients A hJ and B hX are functions of r, 9, and the covariance ma- 
trices. The constants As, A# represent the quantization factors for s and 9. The 
superscripts t+ 1/4, t + 1/2, t + 3/4 indicate the order in which these three steps 
proceed (i.e. a single time step for the complete system is broken down into four 
substeps for implementational convenience). 

The first step requires evaluating the spatial differential operator and involves 
six neighbor points (on the hexagonal spatial lattice). The second step calculates 
the velocity differential operator and involves four neighbor cells. Periodic boundary 
conditions are assumed in space and velocity. The third step performs the normal- 
ization. To guarantee stability of the whole iterative method, the time step At 
has been determined using von Neumann analysis (see Courant and Hilbert 1953), 
which considers the independent solutions, or eigenmodes, of finite-difference equa- 
tions. Typically, the stability conditions involve ratios between the time step and 
the quantization scales of space and velocity. 

If we are performing prediction without measurement, then the fourth step - 
the determination of the activity of the motion-estimation cells - simply reduces to 

Oft* = <yjr% (17) 

If there are measurements, then we apply the motion-estimation equation (Equa- 
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tion [7]), which consists of multiplying the motion prediction with the likelihood 
function (as defined by Equations M and [9j). Therefore the complete update rule is 



<Le = N ftt + l) a ™& ' ■ fi(M2M ( 18 ) 

where N(x, t + 1) is a normalization constant to impose X)«=i oc^x, t + 1) = 1, Vx, £ 
so that we can interpret these activities as the probabilities of the true velocities. 



5 Methods 

Our model has two extreme forms of behavior depending on how well the input 
data agree with the model predictions. If the data are generally in agreement with 
the models predictions, then it behaves like a standard Kalman ?filter (i.e., with 
gaussian distributions) and integrates out the noise. At the other extreme, if the 
data are extremely different from the prediction, then the model treats the data 
as outliers or distractors and ignores them. We are mainly concerned here with 
situations where the model rejects data. The precise situation where noise and 
distractors start getting confused is very interesting, but we do not address it here. 

We ?first checked the ability of our model to integrate out weak noisy stimuli for 
tracking a single dot moving in an empty background. For this situation, temporal 
grouping (data association) is straightforward, and so standard Kalman ?filters can 
be used. To evaluate the model, we ran a set of trials that plotted the relative 
estimation of two velocities as a function of the number of jumps. The graph 
we obtained (see Figure [3]) is very similar to those reported in the psychophysical 
literature (McKee el al. 1986). Henceforth, we assume that the model is able to 
integrate out noisy input. For the remaining experiments, we assumed that the local 
velocity measurements were noisy in the sense that the measurement units specified 
a probability of velocity measurements rather than a single velocity. However, the 
probability of velocities itself was not noisy. 
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We next tested our model on three psychophysical motion phenomena. First, 
we considered the improved accuracy of velocity estimation of a single dot over 
time (Snowden and Braddick, 1989; Watamaniuk el al. 1989; Welch et al. 1997; 
Ascher, 1998). Then we examined the predictions for the motion occlusion experi- 
ments (Watamaniuk and McKee, 1995). Finally we investigated the motion outlier 
experiments (Watamaniuk et al. 1994). 

For all simulations we set the parameters as follows. All dots were moving at 
a speed of 6 jumps per second (jps), where one jump corresponds to the distance 
separating two neighbouring pixels in the horizontal direction. The longitudinal 
and transverse components of the covariances matrices were a m :x,L = 0.8 jump, 
&m:x,T = 0.4 jump, (J m:v ,L = 3.2 jps, and a m : V ,T = 2.6 jps for the measurement, 
<?i:v,L = 2.2 jps, and <7i :VjT = 1.1 jps for the likelihood function, and a x ^ = 0.6 jump, 
<?x,t = 0.3 jump, a v> L = 0.8 jps, and a v T = 0.4 jps for the prior function. These 
parameters were chosen by experimenting with the computer implementation but 
the results are relatively insensitive to their precise values (i.e. detailed fine tuning 
was not required). 

Except for one of the occluder experiments (the occluder defined by distractor 
motion), where we used 18 velocity channels (six equidistant directions and three 
speeds), we used 30 velocity channels positioned at each spatial position, that is, 
six equidistant directions (thus A9 = 60 deg, starting at the origin), and five speeds 
(Ar = 2 jps, with slowest speed channel tuned to 2 jps). To guarantee stability of 
the numerical method we set the time increment to At = 0.6 ms. Initial conditions 
were uniformly distributed density functions. There were 32 spatial positions. 
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Figure 3: Speed Discrimination For Two Moving Dots. This graph shows the 
improved ability to estimate the relative speed of two dots (one of which is moving 
at 2 jumps per second) as a function of the number of jumps. The level of 80% 
correct discrimination is reached for diminishing speed differences (measured by 
dV/V) when the number of jumps increases, consistent with psychophysics (McKee 
et al. 1986). 
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6 Results 



6.1 Velocity Estimation in Time 

For this experiment, the stimulus was a single dot moving in an empty background. 
To evaluate how the velocity estimation evolves in time, we measured two num- 
bers: The first number was the sharpness of the velocity probability at each po- 
sition, which we define to be the negative of the entropy of the distribution (the 
entropy of the distribution is given by — Y^=\ otfj,(x, t) log a^x, t)) plus the maxi- 
mum of the entropy (to ensure that the sharpness is non-negative). Observe that 
the sharpness is the Kullback-Leibler divergence D(a\\U) between the velocity dis- 
tribution ctp(x, t) and the uniform distribution = 1/M, V \i (more precisely 
D(a\\U) = Y.ff=i a ^(x,t)\og{a^(x,t) /U^}.) As is well-known (Cover and Thomas 
1991) the Kullback-Leibler divergence is positive semi-definite taking its minimum 
value of when = U^, V/x and increases the more a M differs from the uniform 
distribution (i.e. the sharper the distribution a M becomes). Hence the higher this 
sharpness, the more precise the velocity estimate. Moreover, the sharpness will be 
lowest in positions where there are no dots moving (i.e. for which (^^(x, t) = 1/M, 
jj, — 1, • • • , M). The target, a coherently moving dot, should have a relatively high 
sharpness response surrounded by low sharpness responses in neighboring positions. 
The second number is the confidence factor P(<fi(x, t)\<&(t — 5)) in Equation [Tj This 
measures how probable the model judges the new input data. The higher this num- 
ber, the more the new measurement is consistent with the prediction (and hence 
the more likely that the measurement is due to coherent motion). 

The results of motion estimation for this experimental stimulus is shown in 
Figure |4| This figure shows an initial fast decrease in the entropy of the velocity 
distribution followed by a slow decrease (i.e. an increase in sharpness of the density 
function) . Also shown is the increase in the confidence factor of the target velocity 
estimation at each new iteration. These two effects indicate the increased accuracy 
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of the velocity estimation with each new iteration, consistent with the psychophys- 
ical literature (Snowdon and Braddick 1989, Watamaniuk el al. 1989, Welch et 
al. 1997, Ascher 1998). Note that for coherent motion the sharpness of the model 
increases rapidly at first and then appears to slow down. Our results suggest that 
the model reaches an asymptotic limit although they do not completely rule out 
a continual gradual increase. We argue that asymptotic behaviour is more likely 
as both the predictions and the measurements have inherent noise which can never 
be eliminated. Moreover, it can be proven that the standard (Gaussian) Kalman 
model will converge to an asymptotic limit depending of the variances of the prior 
and likelihood functions (and we verified this by computer simulations). It seems 
also that human observers also reach a similar asymptotic limit and their accuracy 
does not become infinitely good with increasing number of input stimuli. 

We also tested the estimation of velocity for a dot moving on a circular trajectory 
and our model can successfully estimate the dot's speed, although the sharpness 
and confidence are not as high as they are for straight motion (recall that the prior 
prefers straight trajectories). This again seems consistent with the psychophysics 
(Watamaniuk et al. 1994, Verghese et al. 1999). 

6.2 Single Dot With Occluders 

Next, we explored what would happen if the target dot went behind an occluding 
region where no input measurements could be made (we call this a black occluder). 
The results were the same as the previous case until the target dot reached the 
occluder. In the occluded region, the motion model continued to propagate the 
target dot but, lacking supporting observations, the probability distribution started 
to diffuse in space and in velocity. This dual diffusion can be seen in Figure [5] where 
the sharpness of velocity estimation is shown after the dot entered the occluding 
region. Observe the decrease in sharpness of the density function, indicating a 
degradation of velocity estimation, when the target is behind the occluder. However, 
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Figure 4: Single Dot Experiment. A single dot is moving with constant velocity. 
Increasing accuracy of the velocity estimate with time is visible by an increase in 
the confidence factor and in the sharpness of the density function. Observe how 
the both confidence and sharpness increase rapidly at first but then quickly start to 
flatten out. (Our model outputs a datapoint at each time frame and our plotting 
package joins them together by straight-line sgements.) 
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the model still has enough "momentum" to propagate its estimate of the target dot's 
position even though no measurements are available (this will, of course, breakdown 
if the occluder gets too large). This is consistent with the findings by Watamaniuk 
and McKee (1995) who showed that observers had a higher than average chance of 
detecting the target when it emerges from the occluder into a noisy background. 

We then tested our model with motion occluders, occluders defined by motion 
flows as described in Watamaniuk and McKee (1995), see Figure [6]A We use the 
same measures as for the single isolated dot. We also plot the cells' activities as a 
function of the direction tuning for the cells tuned to the optimal speed (Figure [6^3). 
The plot indicates that the cells can signal non-zero probabilities for several motions 
at the same time. After entering the occluding region, the peak corresponding 
to the target motion gets smaller than the peak due to the occluders. However, 
the peak of the target remains and so the target peak can rapidly increase as the 
target exits from the occluders. Our model predicts that is easier to deal with black 
occluders than with motion occluders which is not consistent with the psychophysics 
that shows a rough similarity of effects for both types of occluders. We offer two 
possible explanations for this inconsistency: First, the model's directionally selective 
units and/or prediction-propagation mechanism have too broad a directional tuning. 
Narrowing it may lead to weaker interactions between the motion occluder and 
the target. Second, in contrast with the black occluders, the motion occluders 
may group together as a surface, which would help explain why the visual system 
may be more tolerant of motion occluders. (Such a heightened tolerance might 
compensate for the initial putative advantage of the black occluders.) Kanizsa 
(1979) demonstrates several perceptual phenomena, which appear to require this 
type of explanation. A limitation of our current model is that it does not include 
such effects. 
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Figure 5: Single Dot Experiment With Black Occluder. (A) This Representation 

shows how the spatial blur and sharpness of the velocity distribution change with 

time (measured in time frames) when a single dot moving along the Y-axis gets 

occluded. The occluding area is from Y = 15 up to Y = 22. (B) The Y-location of 

the peak of velocity distribution as a function of time. Motion inertia (momentum) 

keeps the peak moving during the occlusion, albeit with a tendency to slow down 

(as visible by the wider plateau around time frame 20). 
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Figure 6: Single Dot Experiment With Occluders Defined by Distractor Motion. 
(A) The occluder is defined by vertical motion of distractor dots, visible as circles 
in the left-top frame with arrows indicating velocity (the dots's speed is 6 jumps 
per second, and is represented by the length of the lines). The target motion, not 
shown because it is hidden by the occluder, is moving horizontally from left to right. 
The motion-inertia effect of the target motion on the distractors is shown in the 
estimation and prediction stages where the height and width of the rectangles at 
each point indicate the confidence and sharpness, respectively (a small rectangle 
indicates low confidence and sharpness). The lattice, marked with small dots, is 
hexagonal for the measurement, estimation, and prediction stages. (B) The prob- 
ability distribution of the velocity cells tuned to 6 jumps per second plotted as a 
function of directional tuning at different time steps. As the target dot enters the 
occluder, at time frame 10, two peaks ^^rt developing in the probability distri- 
bution. The bigger the occluder the more the peak induced by the motion of the 
distractor dots starts dominating. But as the dot re-emerges from the occluder it 
rapidly becomes sharp again, time frame 12. 



6.3 Outlier Detection 

In the outlier detection experiments (Watamaniuk et al. 1994, Verghese et al. 1999), 
the target dot is undergoing regular motion but it is surrounded by distractor dots, 
which are undergoing random Brownian motion. To show the velocity estimation at 
each position, we plot the response of our network using a rectangle to display the 
two properties of sharpness and confidence. The width of the rectangle gives the 
sharpness of the set of cells at that point and the height gives the confidence factor. 
The arrow shows the mean of the estimated velocity. It can be seen (Figure [7|) that 
the target dot's signal rapidly reaches large sharpness and confidence by comparison 
to the distractor dots, which are not moving coherently enough to gain confidence 
or sharpness. The sharpness of the target dot's signal does not grow monotonically 
because the distractor dots sometimes interfere with the target's motion by causing 
distracting activity in the measurement cells. 

7 Discussion 

This work provides a theory for temporal coherence. The theory was formulated 
in terms of Bayesian estimation using motion measurements and motion predic- 
tion. By incorporating robustness in the measurement model, the system could 
perform temporal grouping (or data association), which enabled it to choose what 
data should be used to update the state estimation and what data could be ig- 
nored. We also derived a continuous form for the prediction equation, and showed 
that it corresponds to a variant of Kolmogorov's equations at each node. In deriv- 
ing our theory, we made an assumption of spatial factorizability of the probability 
distributions and we made the approximation of updating the marginal distribu- 
tions of velocity at each point. This allowed us to perform local computations and 
simplified our implementation. We argued that these assumptions/ approximations 
(as embodied in our choice of prior probability) are suitable for the stimuli we are 
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Figure 7: Outlier Detection. A target dot (shown as a circle with line pointing 
rightwards indicating its velocity) is half way down the stimulus frame moving from 
left to right. It is surrounded by distractor dots undergoing random Brownian 
motion. After a few time steps the model gives high confidence and sharpness for 
the target dot. Observe that motion estimates of the distractor dots can sometimes 
become sharp if their motion direction is not changing too radically between two 
time frames (see the large rectangles at the bottom of the estimation and prediction 
panels. Graphic conventions as in previous figure. 
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considering, but that would need to be revised to include spatial coherence effects 
(Yuille and Grzywacz 1988, Yuille and Grzywacz 1989). The prior distribution used 
in this paper would need to be modified to incorporate these effects. 

In addition, recent results by McKee and Verghese (personal communication) 
also suggest that that a more sophisticated prior may be needed. Their results seem 
to show that observers are better at predicting where there is likely to be motion 
rather than what the velocity will be. Recall that our prior probability model has 
two predictive components based on the current estimation of velocity at each image 
position: First, the velocity estimation is used to predict the positions in the image 
where the velocity cells would be excited. Second, we predict that the new velocity 
is close to the original velocity. By contrast, McKee and Verghese's results seem to 
show that human observers can cope with reversals of motion direction (such as a 
ball bouncing off a wall). This is a topic of current research and we note that priors 
which allow for this reversal have already been developed by the computer vision 
tracking community (Isard and Blake 1998). 

Simulations of the model seem in qualitative agreement with a number of exist- 
ing psychophysical experiments on motion estimation over time, motion occluders 
and motion outliers (McKee et al. 1986; Watamaniuk and McKee 1995; Watama- 
niuk et al. 1994). Such a qualitative agreement is characterized by three main 
features: (i) the final covariance of the motion estimate, (ii) the number of jumps 
needed to reach such a final covariance, and (iii) the extent of motion inertia during 
an occlusion, or between two measurements. These three features are controlled as 
follows: the likelihood's covariance matrix determines the initial distribution, and 
thus the number of jumps to reach the desired covariance, while the prior's covari- 
ance matrix affects all three features by imposing an upper bound to the covariance 
and by setting the time constants of the diffusion process between measurements. 

Implementation of the model's equations led to a network that shares interesting 
similarities with known properties of the visual cortex. It involves a columnar 
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structure where the columns, at each spatial point, are composed of a set of cells 
tuned to a range of velocities (columnar structures exist in the cortex but none have 
yet been found for velocities). The observation-cell layer involves local connections 
within the columns to compute the likelihood function. The estimated velocity 
flow is computed in a second layer. If these layers exist, it would be natural for 
them to lie in cortical areas involved with motion such as MT and MST (Maunsell 
and Newsome 1987, Merigan and Maunsell 1993). The estimation layer carries out 
calculation according to Kolmogorov's equation, or the discrete long range version 
described in (Yuille et al 1998), and consists of spatial columns of cells which excite 
locally each other to predict the expected velocities in the future. The calculation 
requires excitatory and inhibitory synapses (i.e. no "negative synapses"). The 
inputs from the observation layer multiply the predictions in the estimated layer. 
Finally, we postulate that inhibition occurs in each column to ensure normalization 
of the velocity estimates at each column. This normalization could be implemented 
by lateral inhibition as proposed by Heeger (1992). 

A difficult aspect of this network, as a biological model, is its need for multiplica- 
tions (such multiplications arise naturally from our probabilistic formulation using 
the rules for combining probabilities). Neuronal multiplication mechanisms were 
argued for by Reichardt (1961), Poggio and Reichardt (1973, 1976) on theoretical 
grounds. A specific biophysical model to approximate multiplication was proposed 
by Torre and Poggio (1978). Detailed investigations of this model, however, showed 
that it provided at best a rough approximation for multiplication (Grzywacz and 
Koch 1987). Moreover, experiments showed that motion computations in the rabbit 
retina, for which the model was originally proposed, were not well approximated 
by multiplications (Grzywacz et al. 1990). Recent related work (Mel et al. 1988, 
Mel 1993), though arguing for multiplication-like processing, has not attempted to 
attain good approximations to multiplication. On the other hand, the complexities 
of neurons make it hard to rule out the existence of domains in which they perform 
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multiplications. Overall, it seems best to be agnostic about neuronal multiplication 
and trust that more detailed experiments will resolve this issue. More pragmatically, 
it may be best to favour neuronal models which correspond to known biophysical 
mechanisms. The biophysics of neurons may prevent them from performing "per- 
fect" Bayesian computations and perhaps one should instead seek optimal models 
which respect these biophysical limitations. 

Finally, we emphasize that simple networks of the type we propose can imple- 
ment Bayesian generalizations of Kalman filters for estimation and prediction. The 
popularity of standard linear Kalman filters is often due to pragmatic reasons of 
computational efficiency Thus linear Kalman filters are often used even when they 
are inappropriate as models of the underlying processes. In contrast, the main draw- 
back of such Bayesian generalizations is the high computational cost, even though 
statistical sampling methods can be used successfully in some cases (see Isard and 
Blake 1996). Our study shows that these computational costs may be better dealt 
with by using networks with local connections and we are investigating the possi- 
bility of implementing our model on parallel architectures in the expectation that 
we will then be able to do Bayesian estimation of motion in real time. 
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Appendix 



In this Appendix, we derive Kolmogorov's equation for motion prediction. Let con- 
sider Equation [6] for 5 — > 0: Taking this limit is tricky and requires Ito calculus 
(Jazwinski 1970). Our case is also non-standard because our theory involves prob- 
ability distributions at every point in space, which interact over time. This means 
that we cannot simply use the standard Kolmogorov's equations, which apply to 
a single distribution only. Here we give a simple derivation, however, which uses 
delta functions and is appropriate when the prior model is based on a Gaussian 
The prior specified by equations 11, 12 is written as: 



P(v&t + 6)\{v'(x*,t)}) = j dx* Jdv'(?)G(y(x,t + 6) -v(x ,t) : 5S V ) 

G(x-x'-5v'(x',t):6J: A ) 
X fdx" J dv"(x")G(x- x" -&?"(£", t);5£ s )' [ ' 

The normalization term in the denominator is used to ensure that the probabilities 
of the velocities integrate to one at each spatial point x. In this equation, we have 
scaled the covariances S by 5. This is necessary for taking the limit as 5 i— > 
(Jazwinski 1970). 

Between observations we can express the evolution of the prior density P(v(x, t)) 

as: 



dP(v(x,t)) r P(v(x,t + a)|{ff(i?,t)}) - P(v(x,t)) ^ 

dt = te< s } 

= lim T {/ dx' [ dv'(x")P(v(x,t + 5)\{^(x\t)})P({tr(x",t)}) - P(v(x,t))\ (A2) 

<5m>0 J J 

We now perform a Taylor series expansion of P(v(x, t + 5)\{v'(x', £)}) in powers 
of 5 keeping the zeroth and first order terms (higher order terms will vanish when 
we take the limit as 5 t— > 0). To perform this expansion, we use the assumption 
that this distribution is expressed in terms of Gaussians. As 5 i— > these Gaussians 
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will tend towards delta functions and the derivatives of Gaussians will tend towards 
derivatives of delta functions (thereby simplifying the expansion). This derivation 
can be justified rigourously, playing detailed attention to convergence issues, by 
the use of distribution theory or the application of Ito calculus. If we expand 
G(v(x, t + 5) — v'(x', t) : <5£-v) about 6 = we find that the zeroth order term is a 
Dirac delta function with argument v(x,t + 5) — v'(x',t). This term can therefore 
be integrated out. The derivative with respect to S will effectively correspond to 
differentiating the Gaussian with respect to the covariance. By standard properties 
of the Gaussian this will be equivalent to a second-order spatial derivative. We will 
get similar terms if we expand G(x — x' — 5v'(x',t) : 5~S^), but we will also get an 
additional drift term from differentiating the Sv ; (x',t) argument. In addition, we 
will get other terms from the denominator of Equation |A1| (These are required 
to normalize the distributions and are non-standard. They are needed because we 
have a differential equation for a set of interacting probability distributions while 
the standard Kolmogorov's equation is for a single probability distribution.) We 
collect all these zeroth and first order terms in the expansion and substitute into 



Equation A~2 We can then evaluate the integrals using known properties of the delta 
functions. After some algebra, these integrals yield our variant of Kolmogorov's 
equation: 



-v- ~P(v(x,t)) + ^ J vP(v(x,t))dv. (A3) 

where and are scalar differential operators, and |V| is the 

volume of velocity space / dv. 
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